fix: use sync.Cond to handle no-task blocking wait #299

rvagg · 2021-12-03T05:18:07Z

Based on feedback from #284, I think this is closer to what I want. While we currently don't have use for >1 consumer, I want WaitForNoActiveTasks() to be a general, and potentially (or at least safely) multi-consumer blocking call that doesn't return until there are zero active tasks.

Ref: #284

rvagg · 2021-12-03T05:20:23Z

taskqueue/taskqueue.go

-		case <-tq.noTaskSignal:
-		}
+	tq.noTaskCond.L.Lock()
+	for tq.activeTasks > 0 {


I know this looks unnecessary, but my assumption here is that there are no guarantees as to who gets the next mutex lock if there are multiple parties waiting on it - both via Wait() and Lock(). So if a second task started and got the lock before our wait caller got a mutex notify then we'd not be in a tq.activeTasks>0 condition even after waking up here.

At least that's how it would work in C++ and Java. Or do we have stronger guarantees in Go about who gets the lock next after a Broadcast()?

I think the loop is right - the sync.Cond.Wait docs use a loop as an example too.

mvdan

sync.Cond is actually what I was thinking about suggesting if you wanted to make this more flexible :)

note that there's also sync.WaitGroup, but it's designed for a static pattern where you Add tasks upfront, and Wait afterwards. It may panic if you call Add after Wait, which we may do here. It's a very close use case, but not applicable.

mvdan · 2021-12-03T11:14:25Z

taskqueue/taskqueue.go

-		case <-tq.noTaskSignal:
-		}
+	tq.noTaskCond.L.Lock()
+	for tq.activeTasks > 0 {


I think the loop is right - the sync.Cond.Wait docs use a loop as an example too.

mvdan · 2021-12-03T11:16:13Z

taskqueue/taskqueue.go

 			}
+			tq.noTaskCond.L.Unlock()


sync.Cond.Broadcast doesn't require holding the lock to be called, so I wonder if you could make this loop lock-free by also continuing to use sync/atomic for activeTasks.

I think that's ideally what we want, because otherwise we're adding some amount of lock contention on each worker. The locks are only held for tiny amounts of time here, but they still add their overhead, whereas atomics are comparatively very cheap.

In other words, I think we can just use sync.Cond for its broadcast feature, and continue using atomics for the counter. If our condition was more complex and we couldn't use atomics, then we'd need the shared lock for sure, but I don't think we absolutely need it here.

Careful: if a sync.Cond.Broadcast() is issued and there isn't anyone Wait()ing to "hear it": the broadcast is discarded. You effectively need at least an implicit ( by some other means ) "lock", to ensure that there is something in Wait()ing

I guess this depends on when the Wait call happens. I assume we call Wait before the workers all go idle. If we may call Wait at a later time, then indeed we need something more. Perhaps Wait should first atomically check if we're already idle, and if so, return immediately. Otherwise, block until the next broadcast.

Wait, the current code already does this - it only Waits inside the loop after it checks that we're not already idle.

I think you're saying that replacing the lock with atomics on the broadcast side could add a race, like the following sequence of events:

WaitForNoActiveTasks grabs the lock

WaitForNoActiveTasks sees the counter is 1, so it enters the loop

worker finishes a task, and broadcasts without grabbing the lock

WaitForNoActiveTasks calls Cond.Wait, which doesn't see the broadcast as it is late

Whereas with Rod's current code:

WaitForNoActiveTasks grabs the lock

WaitForNoActiveTasks sees the counter is 1, so it enters the loop

worker finishes a task, and tries to grab the lock to broadcast - blocking, as WaitForNoActiveTasks has the lock

WaitForNoActiveTasks calls Cond.Wait, releasing the block temporarily

worker grabs the lock and broadcasts

WaitForNoActiveTasks sees the broadcast and finishes

So I think you're right. we don't need to hold the lock to broadcast, like the docs say, but not doing so inserts a form of logic race.

Yup! This is what I was trying to convey, sorry for being terse!

mvdan

SGTM - as discussed in the thread, I don't think we can make this lock-free on the worker side.

hannahhoward

LGTM

feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests (#284) * feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests * fixup! feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests fix(responsemanager): fix flaky tests fix(responsemanager): make fix more global feat: add basic OT tracing for incoming requests Closes: #271 docs(tests): document tracing test helper utilities fix(test): increase 1s timeouts to 2s for slow CI (#289) * fix(test): increase 1s timeouts to 2s for slow CI * fixup! fix(test): increase 1s timeouts to 2s for slow CI testutil/chaintypes: simplify maintenance of codegen (#294) "go generate" now updates the generated code for us. The separate directory for a main package was unnecessary; a build-tag-ignored file is enough. Using gofmt on the resulting source is now unnecessary too, as upstream has been using go/format on its output for some time. Finally, re-generate the output source code, as the last time that was done we were on an older ipld-prime. ipldutil: use chooser APIs from dagpb and basicnode (#292) Saves us a bit of extra code, since they were added in summer. Also avoid making defaultVisitor a variable, which makes it clearer that it's never a nil func. While here, replace node/basic with node/basicnode, as the former has been deprecated in favor of the latter. Co-authored-by: Hannah Howard <hannah@hannahhoward.net> fix: use sync.Cond to handle no-task blocking wait (#299) Ref: #284 Peer Stats function (#298) * feat(graphsync): add impl method for peer stats add method that gets current request states by request ID for a given peer * fix(requestmanager): fix tested method Add a bit of logging (#301) * chore(responsemanager): add a bit of logging * fix(responsemanager): remove code change chore: short-circuit unnecessary message processing Expose task queue diagnostics (#302) * feat(impl): expose task queue diagnostics * refactor(peerstate): put peerstate in its own module * refactor(peerstate): make diagnostics return array

fix: use sync.Cond to handle no-task blocking wait

7571eb8

Ref: #284

rvagg requested a review from mvdan December 3, 2021 05:18

rvagg commented Dec 3, 2021

View reviewed changes

mvdan reviewed Dec 3, 2021

View reviewed changes

mvdan approved these changes Dec 3, 2021

View reviewed changes

hannahhoward approved these changes Dec 4, 2021

View reviewed changes

hannahhoward merged commit 83aebf1 into main Dec 4, 2021

rvagg deleted the rvagg/taskqueue-sync branch December 15, 2021 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use sync.Cond to handle no-task blocking wait #299

fix: use sync.Cond to handle no-task blocking wait #299

rvagg commented Dec 3, 2021

rvagg Dec 3, 2021

mvdan Dec 3, 2021

mvdan left a comment

mvdan Dec 3, 2021

mvdan Dec 3, 2021

ribasushi Dec 3, 2021

mvdan Dec 3, 2021

mvdan Dec 3, 2021

ribasushi Dec 3, 2021

mvdan left a comment

hannahhoward left a comment

fix: use sync.Cond to handle no-task blocking wait #299

fix: use sync.Cond to handle no-task blocking wait #299

Conversation

rvagg commented Dec 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvdan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvdan left a comment

Choose a reason for hiding this comment

hannahhoward left a comment

Choose a reason for hiding this comment